In this project, I have chosen to explore and analyze the white wine quality dataset. This dataset contains 4898 white wines with 11 variables on qualifying different attributes. An output variable is also given in the dataset which is the rating of each wine between 0 and 10. In this project, I will analyze the realations between the wine attributes and ratings, and I will explore if there is any strong relationship between the different attributes of the wines.
In this section, I have loaded the data and the variable names are shown in the below.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Now let’s see the structure of the variables:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
We can find there is an X variable there, which is just the indices of wines. Since there is the no missing data in this dataset, I just simply showed the summary for each variable in the below.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
In this section, I will plot several histograms to explore the count distributionsof wines for different variables.
First let’s take a look at the ratings of the wines.
We can find the ratings of the wines follow a normal distribution with center at 6, which shows most of wines got ratings at 5 and 6.
Let’s take a look at the alcohol, we can find with higher alcohol percentage, the counts of wines are decreasing. Alcohol with about 9% have most counts and the data is left skewed.
Let’s take a look at the fixed acidity. We can find the most of wines has fixed acidity between 6 and 8 g/dm^3.
The above histogram is the count of total sulfur dioxide. We can find most of wines have total sulfur dioxide between 100 and 200 mg/dm^3.
This histogram shows the counts for wines with different pH. Most of wines have pH around 3.0 and 3.3.
This histogram shows the counts for wines with residual sugar, we can find most wines have residual sugar under 2.5 g/dm^3.
Last, let’s plot the histograms for every variable in the data under same plot.
There are 4898 observations and 13 variables in this dataset. Among the vaiables, X is the index of the wines and quality is the rating for each wine, and their data type is int. The quality is dependent on all the other variables, which are properties of the wines and they have float data type.
In this dataset, I’m interested in the relations between pH, alcohol and quality. I would like to explore if there is any strong relationship between them.
Density, volatile acidity and free sulfur dioxide may also support my investigation.
I didn’t create any new vaiables by far since I’m not familar with all the chemicals. For different chemicals, the standards of high or low is unclear.
Some data are skewed to the left and some are normally distributed, there is no noticable or unusual distributions in the dataset.
In this part, let’s take a look at some bivariate plots and try my interests on some variables of this dataset.
The above graph is the scatter plot of pH vs. alcohol. In this graph, we didn’t see any strong relationship between pH and alcohol.
The above graph is the scatter plot of residual.sugar vs. pH. In this graph, we didn’t see any strong relationship between residual.sugar and pH.
Density, volatile acidity and free sulfur dioxide may also support my investigation. pH, alcohol and quality
Above is the scatter plot of volatile.acidity vs. pH. My assumption is volatile acidity will affect pH, but from the scatter plot above we didn’t see a strong relationship between each other.
The above plot is total.sulfur.dioxide vs density. We can find with more sulfur dioxide, the density of wine increases.
Let’s take a look at the alcohol vs. density. We can find with the increase on alcohol, the density of the wine drops.
We can find with the plot of pH vs. density, there is no strong relationship between pH and density.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.